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Abstract 

We describe two slice sampling methods for taking multivariate steps using the crumb framework. These 
methods use the gradients at rejected proposals to adapt to the local curvature of the log-density surface, 
a technique that can produce much better proposals when parameters are highly correlated. We evaluate 
our methods on four distributions and compare their performance to that of a non-adaptive slice sampling 
method and a Metropolis method. The adaptive methods perform favorably on low-dimensional target 
distributions with highly-correlated parameters. 



1 Introduction 



Markov Chain Monte Carlo methods often perform poorly when parameters are highly correlated. Our goal 
has been to develop MCMC methods that work well on such distributions without requiring prior knowledge 
about the distribution or extensive tuning runs. 



Slice sampling ( Neal 2003 ) is an auxiliary- variable MCMC method based on the idea of drawing points 



uniformly from underneath the density surface of a target distribution. If one discards the density coordinate 
from the sample, the resulting marginal distribution is the target distribution. 



This document presents two samplers in the "crumb" framework (Neal 2003 §5.2), a general framework 
for slice sampling methods that take multivariate steps. Unlike many MCMC samplers, they perform well 
when the target distribution has highly correlated parameters. These methods can improve on univariate 
slice sampling in the same way Metropolis with a properly chosen multivariate proposal distribution can 
improve on Metropolis with a spherical proposal distribution and on Metropolis updates of one coordinate 
at a time. 



2 Multivariate Steps in the Crumb Framework 

This section describes the crumb framework for taking multivariate steps in slice sampling. The overall 
goal is to sample from a target distribution, such as the one with log-density contours shown in figure [lja). 
The dimension of the target distribution's parameter space is denoted by p. (The example in the figure has 
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Figure 1: (a) Contours of a target distribution, (b) A slice, S yo , and the current target component, Xq. 



p = 2.) The state space of the Markov chain for slice sampling has dimension p + 1, of which p correspond 
to the target distribution and one to the current slice level. 

Slice sampling alternates between sampling the target coordinates and the density coordinate. Let f(x) 
be proportional to the density function of the target distribution, and let (xo,yo) be the current state in 
the augmented sample space, where xq € W and yo € M. To update the density component, y$, we sample 
uniformly from [0, f(xo)]- Updating the target component, Xq, is more difficult. Let S Vo be the slice through 
the distribution at level yo~. 

S V0 = {x : f(x) > y } (1) 

The set S Vo is outlined by a thick line in figure [TJb). The difficulty in updating the target component is that 
we rarely have an analytic expression for the boundary of S yo , which may not even be a connected curve, so 
we cannot sample uniformly from S yo as we would like to. The methods of this document instead perform 
updates that leave the uniform distribution on S yo invariant, leaving the resulting chain with the desired 
stationary distribution. 

These methods begin by sampling a "crumb," c\, from a distribution that depends on xq, then drawing 
a proposal, x±, from the distribution of points that could have generated c%, treating the crumb as data and 
xq as an unknown. An example crumb and proposal are drawn in figure |2ja) , with the distribution of c\ 
shown as a dashed line and the distribution of x\ shown as a dotted line. 

If x\ were inside S yo , we would accept it as the new value for the target component of the state. In 
figure |2ja), it is not, so we draw a new crumb, C2, and draw a new proposal, X2, from the distribution of 
Xq given both c\ and C2 as data. This is plotted in figure [2jb) . While the distribution of proposed moves is 
determined by the crumbs and their distributions, we can choose the distribution of the crumbs arbitrarily, 
using the densities and gradients at rejected proposals if we desire. 

In the example of figure [2jb), X2 is in S yo , so we accept X2 as the new target component. If it were 
not, we would keep drawing crumbs, adapting their distributions so that the proposal distribution would 
be as close as possible to uniform sampling over S yo , and metaphorically following these crumbs back to Xq 



by drawing proposals from the distribution of xq conditional on having observed the crumbs (Grimm and 



Grimm 2008 1 . 
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Figure 2: (a) The first crumb, ci, and a proposal, x\. The distribution of c\ is shown as dashed. The 
distribution of %\ is shown as dotted. In this figure and subsequent ones, probability distributions in a 
two-dimensional space are represented by ellipses such that a uniform distribution over the ellipse has the 
same mean and covariance as the represented distribution. This method allows multiple distributions to be 
drawn in the same plot, (b) The second crumb, C2, and a proposal, xi- The distribution of ci is shown 
as dashed. The distribution of x 2 is shown as dotted. The distribution for C2 has been updated using the 
method described in section [4] 



Neal (2003 §5.2) has more information on the crumb framework, including a proof that methods in this 



framework leave the target distribution invariant. 

3 Overview of Adaptive Gaussian Crumbs 

We now specialize the crumb framework to Gaussian crumbs and proposals, using log densities at proposals 
and their gradients to choose the crumb covariances. Without violating detailed balance, the crumb distri- 
bution can depend on these log densities and gradients. We assume that while computing the log density at 
a proposal, we can compute its gradient with minimal additional cost. 

Ideally, the proposal distribution would be a uniform distribution over S yo . To approximate uniform 
sampling over S yo , we attempt to draw a sequence of crumbs that results in a Gaussian proposal distribution 
with the same covariance as a uniform distribution over the slice. 

In both adaptive methods discussed in this document, the first crumb has a multivariate Gaussian 
distribution: 

ci ~ N(x , Wi 1 ) where W x = a c 7 2 / (2) 

The standard deviation of the initial crumb, <t c , is the only tuning parameter for either method that is 
modified in normal use. The distribution for xo given c\ is a Gaussian with mean c\ and precision matrix 
W\, so we draw a proposal from this distribution: 



N(ci,W^) (3) 



If f{x\) is at least yo, then x\ is inside the slice, so we accept x\ as the target component of the next state 
of the chain. This update leaves the density component, yo, unchanged, though it is usually forgotten after 
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a proposal is accepted, anyway. 

When the proposal is not in the slice, we choose a different covariance matrix, Wk+i, for the next crumb, 
so that the covariance of the next proposal will look more like that of uniform sampling over S yo . The two 
methods proposed in this document differ in how they make that choice; sections [4] and [5] describe the details. 

After sampling k crumbs from Gaussians with mean Xq and precision matrices (Wi, . . . , Wu), the distri- 
bution for the fcth proposal (the posterior for xq given the crumbs) is: 

x k ~N(c k ,A^) (4) 

where A fc — Wi-\ h W k (5) 

wdc k =K k 1 {W 1 c 1 + --- + W k c k ) (6) 

If x k £ S yo — that is, if f(x k ) > yo — we accept x k as the target component. Otherwise, we must choose 
a covariance for the distribution of (k + l)th crumb, draw a new proposal, and repeat until a proposal is 
accepted. 



4 First Method: Matching the Slice Covariance 

In the method described in this section, we attempt to find W k +i so that the (k + l)th proposal distribution 
has the same conditional variance as uniform sampling from S yo in the direction of the gradient of log/(-) 
at Xk- This gradient is a good guess at the direction in which the proposal distribution is least like S yo . 
Figure [3ja) shows an example of this. We plotted the gradients of the log density at thirty points drawn 
from the same distribution (shown as a dotted line) as a rejected proposal; most of these gradients point in 
the direction where the proposal variance is least like the slice. Generally, in an ill-conditioned distribution, 
these gradients do not point towards the mode, they point towards the nearest point on the slice. (In 
a well-conditioned distribution, the directions to the mode and to the nearest point on the slice will be 
similar.) 

4.1 Choosing Crumb Covariances 

To compute W k +\, we will estimate the second derivative of the target distribution in the direction of the 
gradient, assuming the target distribution is approximately locally Gaussian. Consider the (approximately) 
parabolic cut through the log-density surface in figure [3](b) , which has the equation: 

I = -i K i 2 + /3i + 7 (7) 

t is a parameter that is zero at x k and increases in the direction of the gradient; k, f3, and 7 are unknown 
constants we wish to estimate. The coefficient — | is arbitrary; this choice makes ft equal to the negative of 
the second derivative of the parabola. We already had to compute f(x k ) so that we could determine whether 
x k was in S yo ; we assume that Vlog/(a^fe) was computed simultaneously. To fix two degrees of freedom of 
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Figure 3: (a) shows gradients at thirty points drawn from the first proposal distribution (shown as a dotted 
line). These gradients tend to point towards the slice (shown as a thick line), (b) shows a parabolic cut 
through the log-density surface that goes through a rejected proposal, afi, in the direction of the gradient at 



x\, shown as an arrow. u\ is a point on the cut defined by equation 12 (c) adds a parabolic cut through 



the mode in the same direction. These two parabolas have approximately the same second derivative. The 
distance between the two arms of the second parabola when the vertical coordinate is equal to log yo is shown 
as a double-ended arrow, equal to d in equation |16| 



the parabola, we plug these quantities into equation [7] and its derivative with respect to t: 

log/(x fe )--i K -0 2 + /3-0 + 7 = 7 
||Vlog/0r fe )|| = -«• + = 



(8) 
(9) 



We still have one degree of freedom left, so we must evaluate log/(-) at another point on the parabola. 
We choose a point as far away from Xk as Ck is, hoping that this distance is within the range where the 
distribution is approximately Gaussian. Let S be this distance: 



5 = \\x k - Cfcl 



(10) 



Let g be the normalized gradient at Xk'- 



fJ 



Vlog/(x fe ) 



|Vlog/(a*)|| 

Then, the point Uk is defined to be S away from Xk in the direction g: 



(11) 



u k = x k + 8 ■ g 



(12) 



In equation [7j Uk corresponds to t = 5. We evaluate the density at Uk to fix the parabolic cut's third degree 
of freedom, and plug this into equation [7j 



log/K) = -- K( 5 2 +/W + 7 



(13) 
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Solving equations [8j [9j and [13] for k gives 

2 



S 2 



(log/( Ufc ) - \ogf(x k ) - ||VIog/(i fc )ll ' S) (14) 



We can now use k to approximate the conditional variance in the direction <?. If the Hessian is locally 
approximately constant, as it is for Gaussians, a cut through the mode in the direction of V log f(x k ) would 
have the same second derivative as the cut through x k . This second parabola, shown in figure |3jc), has the 
equation: 

i = --nt 2 +M (15) 
M is the log density at the mode. We set t — at the mode, so there is no linear term. For now, assume 



/(•) is unimodal and that M was computed with the conjugate gradient method (Nocedal and Wright 2006 



ch. 5) or some other similar procedure before starting the Markov chain. We solve equation 15 for the 



parabola's diameter d at the level of the current slice, log?/o, shown as a doubled-ended arrow in figure [3jc) : 

d= J*W (16) 

V K 

Since the distribution of points drawn from an ellipsoidal slice, conditional on their lying on that particular 
one-dimensional cut, is uniform with length d, the conditional variance in the direction of the gradient at x k 

2 d 2 2 M-logj/o 

°™ = 12 = 3 ' " k (17) 

With this variance, we can construct a crumb precision matrix that will lead to the desired proposal 
precision matrix. We want to draw a crumb Ck+i so that the posterior of Xq given the k crumbs has a 
variance equal to cr^ +1 in the direction g. Using equation [5j the precision of the proposal given these crumbs 
is: 

A fe+1 = A fe + W k+1 (18) 
If we multiply both sides of equation 18 by g T on the left and g on the right, the left side is the conditional 



precision in the direction g. 

g T A k+1 g = g T (A k + W k+1 )g (19) 

We would like to choose VFfc+i so that this conditional precision is cjjT, ±, so we replace the left hand side of 
equation [19] with that: 

cr^ 1 =g T (A k + W k+1 )g (20) 



As will be discussed in section |4~3} computation will be particularly easy if we choose W k +i to be a scaled 
copy of A k with a rank-one update, so we choose W k+ i to be of the form: 

W k+1 = 0A k + agg T (21) 

a and 6 are unknown scalars. 9 controls how fast the precision as a whole increases. If we would like the 
variance in directions other than g to shrink by 9/10, for example, we choose 6 — 1/9. Since 9 is constant, 
there will be exponentially fast convergence of the proposal covariance in all directions, which allows quick 
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recovery from overly diffuse initial crumb distributions. For this method, 9 is not a critical choice; 9 = 1 is 
reasonable. Substituting equation |2T] into equation |20| gives: 



a-^ = g T (A k + 9A k + agg T )g (22) 

Noting that g T g = 1, we solve for a, and set: 

a = max{<7j^ 1 - (1 + 9)g T A k g, 0} (23) 

a is restricted to be positive to guarantee positive dcfinitencss of the crumb covariance. (By choosing 9 
simultaneously, we could perhaps encounter this restriction less frequently, but we have not explored this.) 



Once we know a, we then compute W k +i using equation 21 
The resulting crumb distribution is: 

Cfc+x-A^o,^) (24) 

After drawing such a crumb, we draw a proposal according to equations [4] to [6j accepting or rejecting 
depending on whether f{x k+ i) > t/o> drawing more crumbs and adapting until a proposal is accepted. 



4.2 Estimating the Density at the Mode 

We now modify the method to remove the restriction that the target distribution be unimodal and remove 
the requirement that we precompute the log density at the mode. Estimating M each time we update 
Xq instead of precomputing it allows M to take on values appropriate to local modes. Since the proposal 
distribution only approximates the slice even in the best of circumstances, it is not essential that the estimate 
of M be particularly good. 

To estimate M, we initialize M to the slice level, logyoi before drawing the first crumb. Then, every 
time we fit the parabola described by equation [7J we update M to be the maximum of the current value of 
M and the estimated peak of the parabola. As more crumbs are drawn, M becomes a better estimate of 
the local maximum. We could also use the values of the log density at rejected proposals and at the {u k } to 
bound M, but if the log density is locally concave, the log densities at the peaks of the parabola will always 
be larger than these values. 



4.3 Efficient Computation of the Crumb Covariance 

This section describes a method for using Cholesky factors of precision matrices to make implementation of 



the covariance-matching method of section 4.1 efficient. If implemented naively, the method of section 4.1 



would use 0(p 3 ) operations when drawing a proposal with equations [4] and [6j We would like to avoid this. 
One way is to represent W k and A k by their upper-triangular Cholesky factors F k and R k , where F k F k — W k 
and R k Rk = A k . 

First, we must draw proposals efficiently. If z% and z-i are p-vectors of standard normal variates, we can 
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replace the crumb and proposal draws of equations 24 and [4] with 



Cfe = x Q + F k 1 zi 
x k =c k + R k ~ 1 z 2 



(25) 
(26) 



Since Cholesky factors are upper-triangular, evaluation of F^ 1 z-y and R k 1 Z2 by backward substitution takes 
0(p 2 ) operations. 

We must also update the Cholesky factors efficiently. We replace the updates of W k and A k in equations [21] 
and [18] with: 



F k+1 = chud(VdR k , y/ag) 
R k+ i = chud( Vl + 0R k , y/ag) 



(27) 
(28) 



Here, chud(i?, v) is the Cholesky factor of R T R ■ 



■ vv 



The function name is an abbreviation for "Cholesky 



update." It can be computed with the LINPACK routine DCHUD, which uses Givens rotations to compute 
the update in 0(p 2 ) operations (Dongarra et al. 1979 ch. 10). 



Finally, we would like to compute the proposal mean efficiently. We do this by keeping a running sum of 
the un- normalized crumb mean (the parenthesized expression in equation [6]) , which we will represent by . 
Define: 



c* = Wici + • • • + W k c k 
= c* k _ x + W k c k 



-k-\ 



F k F k c k 



(29) 
(30) 
(31) 



Then, using forward and backward substitution, we can compute the normalized crumb mean, c k , as: 

c k - R^R^cl (32) 

This way, we can compute c k in 0{p 2 ) operations and do not need to save all the crumbs and crumb 
covariances. 

With these changes, the resulting algorithm is numerically stable even with ill-conditioned target dis- 
tributions. Each crumb and proposal draw takes 0{p 2 ) operations. Figure ^ shows pseudocode for this 
method. 



5 Second Method: Shrinking the Rank of the Crumb Covariance 

The method of section [4] attempts to adapt the crumb distribution so that the proposal distribution matches 
the shape of the slice. However, it often can't due to positive-definiteness constraints, requiring the max{-, •} 
operation in equation |23| Even when it can perform the adaptation, it may not be appropriate if the 
underlying distribution is not approximately Gaussian. 

This section describes a different method, also based on the framework of section [3] Instead of attempting 
to match the conditional variance in the direction of the gradient, it just sets it to zero. This is reasonable 
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Slice Sampling with Covariance Matching 



Initialize N, /, x 6 M p , a c , and 6. 
X<-() 

repeat N times: 
M^log/(:ro) 

e <— draw fromExponcntial(l) 
j/o <- M - e 
i? <- a- x l 
F <- a- 1 / 
c* <- 

repeat until a proposal is accepted: 
z «— draw from iVp(0, /) 
c «- .t + -F 1-1 ^ 
c* ^ c* + F T i^c 
c«- R- 1 R- T c' 
z <— draw from iV p (0, 7) 
x -s— c + R~ x z 
V *- log /(a;) 
if y > j/o: 

proposal accepted, break 
end (if) 

G <- Vlog/(x) 
S<-G/||G|| 
5 <- \\x - c\\ 
u <— x + Sg 
l u <- log/(u) 

fs«--2<r 2 (* u - jf-<5||G||) 

M <- max{M, 4,„} 

a <- max {0, cr" 2 - (1 + 6)g T R T Rg) 
F <- chud (VftR, vW) 
i? <- chud (VTT9R, Jag) 
end (repeat) 
append x to X. 
Xo 4— x 
end (repeat) 
return X 



Figure 4: This figure shows the adaptive algorithm of section [4] The variables are mostly the same as in the 
text, with a few exceptions: To avoid underflow, we set y = logy and j/o = logyo! see Neal (2003 §4) for 
discussion. The k subscript is dropped since there is no need to keep copies of the values from any but the 
most recent iteration. The variable £ x . u indicates the peak of the parabolic cut through x and u; N indicates 
the number of samples to draw. The generated samples are stored in the ordered set X. 
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in that the gradient at a proposal probably points in a direction where the variance is small, so it is more 
efficient to move in a different direction. 

With this method, the crumb covariance is zero in some directions and spherically symmetric in the rest, 
so its simplest representation is the pair (w, J), where J is a matrix of orthonormal columns of directions in 
which the conditional variance is zero, and w 2 is the variance in the other directions. 

Define P(J, v) to be the function that returns the component of vector v orthogonal to the columns of J: 

!v — JJ T v if J has at least one column 
(33) 
v if J has no columns 

For simplicity of computation, since the crumbs are located in a common subspace with xq, this method will 
consider the origin to be at xq except when calling /(•), which is provided by the user, and when returning 
samples to the user. Each crumb is drawn by: 

Cfc = cr c P(J, z) where z ~ N P (Q, I) (34) 

When the first crumb is drawn, J has no columns, so P(J, z) — z and the first crumb has the distribution 
specified in section [3j 

Given k crumbs and the covariances of their distributions, we know that xq must lie in the intersection 
of the subspaces of their covariances. Since the subspace Cj is drawn from contains the subspace c k is 
drawn from when k > j, this is equivalent to saying that x$ must lie in the subspace of Cfc's covariance, the 
orthogonal complement of J. So, the precision of the posterior for xq in the direction of columns of J is 
infinite. It is equal to ka~ 2 in all other directions, since there are k crumbs, each with spherical variance 
equal to a 2 in the subspace they were drawn in. The mean of the proposal distribution (with origin Xq) is 
the projection of: 

cr7 2 °l-\ 1- gc^gfc 

k(Jc 2 

= k- 1 (c 1 + --- + c k ) (35) 

Therefore, to draw a proposal, we draw a vector of standard normal variates, project it into the orthogonal 
complement of J, scale by a c /Vk, and add c, also projected into the orthogonal complement of J. With the 
original origin, the proposal is: 

x k = x + P(J, c) + (<7 C /Vk.) ■ P{J, z) where z - N p {0, 1) (36) 

If the proposal is in the slice (that is, if f(xk) > yo), we accept it. Otherwise, we adapt the crumb distribution. 
If J has p — 1 columns, we can't add any more without the crumb covariance having a rank of zero, so we 
do not adapt in that case. Otherwise, we add a single new column in the direction of Vlog/(a3fe), projected 
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into the directions not already spanned by J, which we denote by g* . Thus, the new value of J would be 



J, 



fc+i 



Jk 



where g* = P ( J k , V log f(x k )) 



(37) 
(38) 



To prevent meaningless adaptations, we only perform this operation when the angle between the gradient 
and its projection into the nullspace of J is less than 60°. Equivalently, we only adapt when: 



g* r Vlog/(a; fc ) 
|| || V log /(a*) || 



> 



(39) 



After possibly updating J, we draw another crumb and proposal, repeating until a proposal is accepted. The 
method of this section is summarized in figure [5| 

A variation on the method (which we use in the implementation tested in section [7]) shrinks the crumb 
standard deviation in the nonzero directions, a c , by a constant factor each time a crumb is drawn; section 
[7] uses 0.9. This results in exponentially falling proposal variance, which, as described in section |4~Tj allows 
large initial crumb variances to be used. 



6 Procedure for Evaluation 

Figures [6] and [7] demonstrate the performance of the methods described in this document. Figure [6] demon- 
strates the performance of four samplers on a Gaussian distribution (to be described in section [7]). It contains 
two graphs, one for each of two figures of merit. The top graph plots log density function evaluations per 
independent sample against a tuning parameter. The bottom graph plots processor-seconds per independent 
sample against a tuning parameter. For both figures of merit, smaller is better. 

Each of the four panes in each graph contains data from a single sampler, with each point representing 
a run with a specific tuning parameter. The tuning parameter for each sampler has the same scale; the 
sampler initially attempts to take steps roughly that size. 

Both figures of merit require us to determine the correlation length, the number of correlated samples 
equivalent to an independent sample. For the runs done here, an AR(1) model captures the necessary 
structure. For each parameter, we fit the following model: 

X t = E(X t )+TcX t -i+at (40) 



where o t is a noise process. Then, the number of samples equivalent to a single independent sample is: 



var(af) 



var(A t ) ■ (1 - iif 



(41) 



This formula is based on the ef f ectiveSize function in CODA (Plummer et al. 2006), which uses the 



spectral approach of Heidelberger and Welch ( 1981 ). Wei ( 2006 pp. 276-278) has a more in-depth discussion 



of the spectrum of AR processes. To estimate a correlation length for a multivariate distribution, we take 
the largest estimated correlation length of each of its parameters. This is not valid in general, but is an 
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Crumbs with Shrinking-Rank Covariance 



Initialize N, /, x € 1R P , and a c . 
X<-() 



repeat N times: 

e <— draw fromExponential(l) 

Vo <- log/(x ) - e 

J<-[] 

c <- 

repeat until a proposal is accepted: 
k <- k + 1 

z <— draw from N p (0, 1) 
c 4— a c z 

c <- fc _1 ((fc - l)c+c) 
z draw fromJV p (0, /) 



<?* «-P(J,Vlog/(a;)) 
if 5 * T Vlog/(:r)> A|| 5 *|| ||Vlog/(x)||: 

J^[J 9*/\\9*\\} 
end (if) 

end (if) 



end (repeat) 
append x to X. 

Xq <— X 

end (repeat) 
return X 



Figure 5: The adaptive algorithm of section [5j This differs from that section in that the projection into the 
nullspace of J is delayed until drawing a proposal, reducing the number of matrix products computed. 




if log/ (x) > y : 



proposal accepted, break 
end (if) 

if J has less than p — 1 columns: 
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acceptable approximation for this experiment. 

Most chains are displayed with a circle indicating an estimate of the figure of merit, with a line indicating 
a nominal 95% confidence interval. The intervals are based on normality of the increments of the AR(1) 
process, so they should be viewed as a lower bound on the uncertainty of the point estimates. Chains whose 
figures are plotted as question marks were estimated as having an effective sample size of less than four. 

Figure [7] contains information on four different samplers. The panes of figure [7] are similar to those of 
figure [6j and each is labeled with the distribution and the sampler the chains in that pane come from. The 
columns of panes correspond to samplers; the rows of panes correspond to distributions. By reading across, 
one can see how different samplers perform on a given distribution. By reading down, one can see how the 
performance of a given sampler varies from distribution to distribution. 

Every chain has 150,000 samples. In general, the chain length does not affect the results; we will point 
out exceptions to this. 

7 Results of Evaluation 

We simulated four samplers on four distributions for twelve different tuning parameters each. The samplers 
we considered are: 

• Covariance-Matching: This is the method described in section [4j The tuning parameter is a c . 

• Shrinking-Rank: This is the method described in section [5] The tuning parameter is er c . 

• Non- Adaptive Crumbs: This is a non-adaptive variant of the general method of section [3] It is like the 
method of section [5j but never shrinks rank. However, like that method, it scales the crumb standard 
deviation down by 0.9 after each proposal. The tuning parameter is <r c . 

• Metropolis (with Trials): This method is a Metropolis sampler that uses trial runs to automatically 
determine a suitable Gaussian proposal distribution. The tuning parameter specifies the standard 
deviation of the proposal distribution for the first trial run. 

The distributions we considered are: 

• N<i(p — 0.999): This is a four-dimensional Gaussian with highly-correlated parameters. The marginal 
variances for each parameter are one; each parameter has a correlation of 0.999 with each other pa- 
rameter. The covariance matrix has a condition number of about 2900 and is not diagonal. 

• N^(p = —0.3329): This is a four-dimensional Gaussian with negatively-correlated parameters. The 
marginal variances for each parameter are one, and each parameter has a correlation of —0.3329 with 
each other parameter. The covariance has a condition number of about 2900, like N±{p — 0.999), but 
instead of one large eigenvalue and three small ones, this distribution has one small eigenvalue and 
three large ones. 

• Eight Schools: This is a multilevel model in ten dimensions, consisting of eight group means and 



hyperparameters for their mean and variance. It comes from Gelman et al. (2004 pp. 138-145) 
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Comparison of density evaluations on N[4](rho=0.999) 



Covariance-Matching 



Shrinking-Rank 



Non-Adaptive Crumbs Metropolis (with Trials) 
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Figure 6: The performance of four MCMC samplers on N±(p = 0.999). See section [6] for a description of the 
graphs and section [7] for discussion. 



• Ten-Component Mixture: This is a ten-component Gaussian mixture in R 10 . Each mode is a spherically 
symmetric Gaussian with unit variance. The modes are uniformly distributed on a hypercube with 
edge-length ten. 

By comparing the top and bottom graphs of figure [6| we see that for N^p = 0.999), processor time 
and number of density evaluations tell the same story. The plots of function evaluations and the plots of 
processor time are nearly identical except for their vertical scales. This is true of the other distributions as 
well, so we do not repeat the processor-time plots for the others. 

Figure [6] (and the identical first row of figure [7]) shows the performance of the four methods on a highly- 
correlated four-dimensional Gaussian, N^(p = 0.999). Both adaptive slice sampling methods perform well 
once the tuning parameter is at least the same order as the standard deviation of the target distribution. 
This hockey-stick performance curve is characteristic of slice sampling methods. The non-adaptive slice 
sampling method always takes steps of the order of the smallest eigenvalue once its tuning parameter is at 
least that large, so its performance is bad, but after this threshold, its performance does not depend much 
on the tuning parameter. The Metropolis sampler also fails to capture the shape of the distribution, a result 
that depends more on chain length. Had the chain been longer, the preliminary runs the sampler uses to 
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Figure 7: The performance of four MCMC samplers on four distributions. See section [6] for a description of 
the graph and section [7] for discussion. 
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estimate a proposal distribution might have worked better, leading to average performance comparable to 
the adaptive slice samplers. 

The second row of figure [7] shows sampler performance on a similar but negatively correlated four- 
dimensional Gaussian, N±(p = —0.3329). The adaptive slice sampling methods continue to perform well on 
this distribution, and the non- adaptive sampler improves somewhat. The Metropolis sampler improves a 
great deal; on this target distribution, it is able to choose a reasonable proposal distribution, so it performs 
comparably to the adaptive slice sampling methods. 

The third row of figure [7] shows sampler performance on Eight Schools. The minimum threshold appears 
again for both adaptive slice samplers as well as the non-adaptive slice sampler. Since the condition number of 
the covariance of this distribution is only about seven, adaptivity is not as important, though slice sampling's 
robustness to improper tuning parameters remains important. The adaptive Metropolis method again fails 
to identify a reasonable proposal distribution for small tuning parameters, and fails to generate any proposal 
distribution at all for large ones. This is partially a reflection on this particular implementation, which only 
tries preliminary runs with proposal distribution standard deviations within four orders of magnitude of the 
tuning parameter. 

The bottom row of figure [7J which shows performance on Ten-Component Mixture, has a similar pattern. 
The results are more erratic since the distribution has multiple, moderately-separated modes, and none of 
the samplers are designed to perform well on multimodal distributions. 

8 Discussion 

The adaptive slice sampling methods of sections [4] and [5] generally perform at least as well as non-adaptive 
slice sampling methods and Metropolis. Slice sampling in general tends to be more robust to imperfect 
choice of tuning parameters than Metropolis. Preliminary chains are usually unnecessary, avoiding the 
hassle of manual management of these runs or the idiosyncratic performance of automatic evaluation of the 
runs. The main disadvantage of the adaptive slice samplers relative to Metropolis and non-adaptive slice 
sampling is that they require the log density to have an analytically computable gradient, though this is a 
standard requirement in numerical optimization, and experience in that domain has shown that computing 
the gradient is often straightforward. 

The two adaptive slice sampling methods tend to perform similarly to each other. The shrinking-rank 
method usually performs slightly better, but this advantage can be mitigated by making the approximation 
log/(ttfc) rj log yo- There is no theoretical justification for this, but it cuts the number of log density 
evaluations by half with negligible performance cost, making the performance of the two adaptive methods 
indistinguishable. The shrinking-rank method is simpler, though, and requires only 0(min(fc,p)-p) operations 
to draw the fcth crumb and proposal, slightly better than the 0(p 2 ) operations needed by the covariance- 
matching method. 

Like most variations of multivariate slice sampling, both the non-adaptive and adaptive methods described 
here do not work well for target distributions in spaces higher than a few dozen. The variation in log density 
of samples increases with dimension, but slice sampling takes steps in the log density of only order one each 
iteration. So, in high-dimensional spaces, samples tend to be highly correlated. 

Due to this poor scaling with dimensionality, the methods of this document have limited usefulness on 
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their own. They may be useful in highly-correlated low-dimensional problems, though, and can be used to 
take steps in highly-correlated low-dimensional subspaces as part of larger sampling schemes. We hope to 



address this limitation in future work, possibly with polar slice sampling (Roberts and Rosenthal 2002) 



An R implementation of these methods is available at htt p : //www . cs . utoronto . ca/~radf ord| 
References 

Dongarra, J. J., Moler, C. B., Bunch, J. R., and Stewart, G. W. (1979). UNPACK User's Guide. Society 
for Industrial and Applied Mathematics. 

Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data Analysis, Second Edition. 
Chapman and Hall/CRC. 

Grimm, J. and Grimm, W. (2008). Hansel and Gretel. In Grimm's Fairy Tales. Gutenberg Project. 

Heidelberger, P. and Welch, P. D. (1981). A spectral method for confidence interval generation and run 
length control in simulations. Communications of the ACM, 24(4):233-245. 

Neal, R. M. (2003). Slice sampling. Annals of Statistics, 31:705-767. 

Nocedal, J. and Wright, S. J. (2006). Numerical Optimization, Second Edition. Springer. 

Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). CODA: Convergence diagnosis and output analysis 
for MCMC. R News, 6(1):7-11. 

Roberts, G. O. and Rosenthal, J. S. (2002). The polar slice sampler. Stochastic Models, 18(2):257-280. 

Wei, W. W. S. (2006). Time Series Analysis: Univariate and Multivariate Methods, Second Edition. Pear- 
son/ Addison Wesley. 



17 



